-
Notifications
You must be signed in to change notification settings - Fork 243
Image Size Filtering for PhotoRec #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I tried various ways to handle buffering in memory instead of writing to disc and I have to say, it was tough to pull off in PhotoRec. The catch is it requires around 2GB of memory, and depending on how much memory you provide, it can impact on whether some files get discovered or not, which is a big problem. In other branches I tried very much to get it to work in a way where it control everything in a court - open the file, write to it, and close it - but it’s almost impossible, given that the code is very procedural. For instance, the code has dozens of places where it tests whether a file handler exists, so it’s very hard to get it to work without breaking something and it not finding everything that it normally would. So what would really help is probably a big redesign of the entire code base, or something to the effect of replacing fwrite/fclose with user defined functions that hold the files in memory rather than on the disc. I guess I could look into this more some other time. It’s also important to say that the use of buffers will not help in decreasing the search time. The major bottleneck is the reading of the source disk/image, and not the writing of the recovered files. |
Fixes #167
Image Size Filtering for PhotoRec
This PR implements filtering of recovered image files by dimensions and file size, addressing the requirement to skip thumbnail-sized images during recovery. The feature currently supports JPG and PNG formats with memory-efficient buffering.
Problems Addressed
Excessive I/O for small files: PhotoRec's original architecture opens a file handle for every detected file signature, writes data to disk, then evaluates filters post-recovery. For recoveries with thousands of thumbnails (10-50KB JPG/PNG files), this meant:
No dimension-based filtering: There was no option to filter by image dimensions (width, height, resolution) and by image filesize on request.
Solution
Pre-save filtering with memory buffering: To filter images without wasting I/O, PhotoRec needs to know both dimensions AND file size before creating files on disk. But this creates a problem: dimensions are in the image header (first few hundred bytes), while actual file size requires finding the end-of-file marker.
The solution uses memory buffering combined with a new
file_check_presave()callback:Instead of writing files to disk immediately:
This eliminates wasted disk I/O for rejected images entirely. The
file_check_presave()callback operates on memory buffer where both dimensions and file size are known, allowing complete filtering decision before any disk writes.Core Changes
New filtering module (
src/image_filter.c,src/image_filter.h):307200) and dimension format (640x480)800-1920or-1080for "no min, max 1080")File format handlers (
src/file_jpg.c,src/file_png.c):file_check_presave()callback that evaluates filters on recovered file data (from memory buffer if buffering is active, or from initial read buffer otherwise)is_image=1flag in file_hint structures to enable memory buffering for these formatsTo enable image filtering for other formats, modify the file format handler (
file_*.c) to:is_image=1in thefile_hint_tstructurefile_check_presave()callback that:should_skip_image_by_dimensions()andshould_skip_image_by_filesize()fromimage_filter.hheader_check_*()function:file_recovery_new->file_check_presave = &your_presave_callbackfile_recovery_new->image_filter = file_recovery->image_filterSee
file_jpg.c:jpg_maches_image_filtering()andfile_png.c:png_maches_image_filtering()for reference implementations.Memory buffering (
src/filegen.c):calloc()instead ofmalloc()to avoid immediate physical memory allocationncurses UI (
src/phrecn.c):CLI interface (
src/phcli.c,/cmdbatch mode):imagesize,size,MIN-MAX,width,MIN-MAX,height,MIN-MAX,pixels,MIN-MAX100k,1.5m,2g(kilobytes/megabytes/gigabytes)800-1920(range),800-(min only),-1080(max only)pixels,307200-2073600(direct pixel values)pixels,640x480-1920x1080(width×height, auto-multiplied to pixel count)imagesize,size,100k-,width,800-,height,600-(min 100KB, min 800×600)imagesize,pixels,640x480-(min 640×480 resolution = 307200 pixels)Session persistence (
src/sessionp.c):Testing
Python test suite available at https://gist.github.com/piotrkochan/1eb15d8ecb85c866e716bd07ee48d203
The test script automates validation by running PhotoRec against a disk image with various filter configurations, then verifying that recovered files match the specified criteria using ImageMagick's
identifycommand. It tests file size filtering with min/max/range values and unit notation (k/m/g), dimension filtering for width and height with various boundary conditions, and resolution filtering in both pixel count and WIDTHxHEIGHT format. Combined filters with multiple parameters active simultaneously are also tested. The script performs automatic baseline analysis using percentile calculations to generate realistic test ranges based on actual recovered content.Future Work
This implementation is designed for extensibility:
image_filter.cfor easy addition of other image formats (GIF, BMP, TIFF, WebP, etc.)